One zarr to rule them all #12

sadamov · 2024-03-26T15:34:30Z

This PR simplifies the dataloader and input-data creation.
New features

create_zarr_archive.py now generates one huge zarr-archive directly from the COSMO-2 GRIB-files. @cosunae this might be of interest for offline data preparation.
weather_dataset.py now loads one zarr archive and __get_item__ now also returns the datetimes of the current batch
In ar_model.py the datetimes of the batch are compared to constants.EVAL_DATETIMES. Massively simplifying model testing and prediction @clechartre this is certainly relevant for prediction/verification.

Reasoning

After discussions within ECMWF/Neural-LAM it became pretty clear that one single zarr is the way to go. Lazy loading and parallelization allow for huge archives. The reason why it didn't work for me last year was most likely related to the issues we had with /scratch being extremly slow and unresponsive. It's also much easier to share dataloaders and datasets this way.
Verification of the model is much simplified keeping track of the datetime of each batch. Model predictions can still be easily exported using the constants.STORE_EXAMPLE_DATA.

Notes

Zarr creation is ongoing, it roughly takes 3-4 days without Balfrin crashing.
For now I have created a small cosmo_single dataset that can be used for training and evaluation.
A new dummy example.ckpt was uploaded for evaluation/testing (trained on very small dataset)
Maybe it makes sense to wait for the full zarr creation and retraining of the model before this PR is merged. I really wanted to share the code already to prevent duplicated work though.

- improved dask SLURMCluster setup, since data-retrieval is massive

better communication protocol (hsn0) more workers :)

set time chunking to roughly 250MB chunksize appending in pieces to zarr to prevent timeout of dask client

https://github.com/MeteoSwiss-APN/cosmo-archive-retrieve

plotting with rank zero only

sadamov · 2024-04-16T08:29:19Z

Okay this PR is ready for review and merge. I have copied @cosunae latest single zarr to balfrin.cscs.ch and updated the code accordingly. All zarr-creation scripts are now removed from the repo. The new example file is trained on a few timesteps just for debugging. I suggest to do the following before merge:

@twicki can you execute slurm_train.py and slurm_eval.py and make sure it works
@clechartre can you execute slurm_predict.py and cli_plotting.py and make sure everything is still as you wanted it to be
In case you have additional comments on the new functionalities, you are very welcome to add them to your review.

clechartre

This code is functional for the inference and the prediction plotting

Pinning pytorch to a specific version was required because otherwise on some systems cpu variants would be installed

sadamov added 2 commits March 23, 2024 06:12

One zarr to rule them all

ce6d332

Dataloader updated to handle single huge zarr archive

5b5d028

sadamov requested a review from twicki March 26, 2024 15:34

sadamov added 10 commits March 26, 2024 16:37

dask missing for CICD

1f9519c

- actually only create one zarr (train test split in dataloader)

9d55ab1

- improved dask SLURMCluster setup, since data-retrieval is massive

fix pre-commit

e789882

more dask workers, missing CICD import

db1b0d1

bugfixes,

85cabae

better communication protocol (hsn0) more workers :)

De-couple slurm submission from python logic

3a0d715

loggin folder

e3d26d8

slurm job should call slurmCluster to prevent running on login node

f379540

set time chunking to roughly 250MB chunksize appending in pieces to zarr to prevent timeout of dask client

Chunking needed some finetuning to optimize runtime on HPC system

a52e7d9

pre-commit formatting

fdcda6a

sadamov mentioned this pull request Apr 11, 2024

Inference #13

Merged

3 tasks

sadamov added 7 commits April 12, 2024 17:19

Merge remote-tracking branch 'origin/main' into one_zarr

ff8926e

zarr creation now happens in

19b7274

https://github.com/MeteoSwiss-APN/cosmo-archive-retrieve

single zarr is the new default dataset for cosmo

9a80105

cosmetics

1c4f0b5

new set of input feature channels

b92e11c

added set of utils functions

2823578

plotting with rank zero only

new zarr dataset provides cartesian coordinates directly

1d920c7

sadamov requested a review from clechartre April 16, 2024 07:55

Rank zero specification for proper wandb use

b8650d1

clechartre approved these changes Apr 17, 2024

View reviewed changes

sadamov mentioned this pull request Apr 17, 2024

add a first version MeteoSwiss/cosmo-archive-retrieve#1

Merged

Pinning Pytorch cuda version

92335a4

Pinning pytorch to a specific version was required because otherwise on some systems cpu variants would be installed

sadamov merged commit 22cddcb into main Apr 22, 2024
1 check passed

sadamov deleted the one_zarr branch April 22, 2024 09:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One zarr to rule them all #12

One zarr to rule them all #12

sadamov commented Mar 26, 2024 •

edited

Loading

sadamov commented Apr 16, 2024

clechartre left a comment

One zarr to rule them all #12

One zarr to rule them all #12

Conversation

sadamov commented Mar 26, 2024 • edited Loading

sadamov commented Apr 16, 2024

clechartre left a comment

Choose a reason for hiding this comment

sadamov commented Mar 26, 2024 •

edited

Loading